PM 566 Assignment 1

Author

Dana Gonzalez

Assignment Details

In this assignment, I will be using Environmental Protection Agency (EPA) air pollution data to determine whether or not daily concentrations of PM2.5 have decreased in California from 2002 to 2022.

Step 1

Read CSV into Dataframe

The data for 2002 includes 15,976 observations (rows) of 22 variables (columns). The 2022 data has the same 22 variables (columns), but instead has 59,756 observations of each (rows).

Code
Data_2002 = read.csv ("~/Desktop/PM 566/PM566-Labs/PM2.5_2002_Data.csv")
Data_2022 = read.csv ("~/Desktop/PM 566/PM566-Labs/PM2.5_2022_Data.csv")

Check Dimensions, Headers, Footers, Variable Names, and Variable Types

Again, this shows that the 2002 and 2022 data sets both have 22 variables (columns), and 15,976 and 59,756 observations (rows) of these variables, respectively.

Code
dim(Data_2002)
[1] 15976    22
Code
dim(Data_2022)
[1] 59756    22

There do not seem to be any obvious or clear irregularities at the top of the data for either year.

The same goes for the bottom of the data (although I did have to check to see if Yolo county was real).

This function allowed us to double check the number of observations and variables for either data set (which matched the outputs for the other functions above). Too, this function allowed us to see more of our data sets’ variable names, variable types, and a few observations for each. Again, there don’t seem to be any clear or obvious irregularities.

By using the summary function we are able to see various measures of central tendency, measures of spread, and other pieces of information for all 22 of our variables, for each year. Once again, there don’t seem to be any clear or obvious irregularities.

Step 2

Combine 2002 and 2022 Data Into One Dataframe

Code
Combined_Data <- rbind(Data_2002, Data_2022)

Create New Year Column

Code
Combined_Data$Date <- as.Date(Combined_Data$Date, format = "%m/%d/%Y")
Combined_Data$Year <- format(Combined_Data$Date, "%Y")

Rename Key Variables

Code
names(Combined_Data)[names(Combined_Data) == "Daily.Mean.PM2.5.Concentration"] <- "Daily_PM2.5"
names(Combined_Data)[names(Combined_Data) == "Daily.AQI.Value"] <- "Daily_AQI"

Step 3

Map of Collection Sites

Although the monitoring sites are spread throughout California, they seem to be more concentrated along the coast, as well as in/around major cities (i.e., Los Angeles, San Francisco, San Jose, San Diego). Too, there are relatively very few sites in Southeast California (Eastern regions of San Bernardino, Riverside, and Imperial counties).

Code
Sites <- (unique(Combined_Data[,c("Site.Latitude","Site.Longitude")]))  
dim(Sites)
[1] 202   2
Code
library(leaflet)
leaflet(Sites) |> 
  addProviderTiles('CartoDB.Positron') |> 
  addCircles(lat = ~Site.Latitude, lng = ~Site.Longitude,
             opacity = 1, fillOpacity = 1, radius = 400,  color = c('pink', 'red'))

Step 4

Based on some quick Google searches, most of these daily PM2.5 values seem plausible. Annual averages for California (specifically, Los Angeles) fall around 9 ug/m3, and daily averages can be as high as 35 ug/m3 for the same areas.

We see values much higher than this in our dataset (upwards of 50-69 ug/m3). Still, these values may still be okay as events like wildfires can drastically impact daily PM2.5 concentration averages (e.g., the 2018 Camp Fire in Sacramento lead to a daily PM2.5 concentration of 263 μg/m3, the highest ever recorded in California).

We also see a number of negative values with our daily PM2.5 concentrations. After some more Google searches, I learned that can occur because of two main circumstances: either there is some issue with a measuring instrument, or a measurement is taking place while the atmosphere is extremely clean (approaching 0μg/m3) and there is some level of measurement noise.

After a quick skim of the data, I’m leaning towards thinking that this data set’s negative values are due to the latter explanation, as the majority of them do not exceed -1.0μg/m3.

There do not seem to be any missing values for our variables of interest.

Step 5

Exploratory Graphs

Code
library(ggplot2)